Analyzing Vehicular Accidents in Maryland

By: Ashlynn Grawehr

210503-ocean-city-car-crash-ew-1126p.jpg Pictured above is a car crash on the Bay Bridge, in Maryland.

Introduction

In order to demonstrate the "Data Science Pipeline", I am using the State of Maryland's vehicular accident/crash data. It should come as no surprise to hear that driving is one of the most dangerous forms of transportation. It's estimated that 2 out of 3 drivers will get into an injury accident in their life (i.e. a vehicular accident that causes physical injury). Aggressive driving and speeding are to blame for most car accidents. Understanding and promoting awareness about the real dangers of driving is incredibly important. It's something that many of us take for granted, but can become very dangerous very quickly.

sources: https://www.drive-safely.net/driving-statistics/ & https://www.nbcnews.com/news/us-news/heroic-car-crash-witness-saves-toddler-who-was-ejected-maryland-n1266212

Data Collection

Source

After looking through many data sets from https://opendata.maryland.gov/, I eventually settled on Maryland Statewide Vehicle Crashes. It provided me with data ranging from January 2015 to December 2020, from all the counties in Maryland. Not only is it relatively recent, but it provided me with a lot to work with. This website has data sets for just about everything related to Maryland, though! I got my data from https://opendata.maryland.gov/Public-Safety/Maryland-Statewide-Vehicle-Crashes/65du-s3qu. There is a handy link to download the csv file, and the website has a visualization feature too!

Libraries

Data Tidying

Not only is data (sometimes) read messily into a dataframe, but I certainly did not need all 56 columns that my data set provided me with. I was able to get rid of most of them, including columns 34 and 46 (that were eliciting a warning above). Most of the columns containted numerical representations of the descriptions, and the key was not provided so the numbers were virtually useless. I ended up keeping:

The column names were changed to make things easier to type, to take up less space, and because I did not like them before. They all have the same meanings as before, they are just a little more concise now.

After renaming columns, some new columns had to be added (for convinience later). I added a datetime object, using the Date column from the dataframe. I also added a DaysSinceStart counter. I figured it would give me a little more to play around with later on, so I'm not soley relying on the datetime object. I also added a Y/NSevere column. This column has a 1 in it if the crash was classified as Injury Crash or Fatal Crash. It received a 0 otherwise. And, lastly, I added a %Light column to give this categorical variable some sense of numerical. I classified Day light as 1.0, Dawn/Dusk as .5, Dark Lights On as .25, and everything else as 0 (as it's in the darkness).

I also decided to drop any rows that were missing data. My data set has 666K rows, so I could lose a few without impacting the results. I also did not have a meaningful average for the categorical variables, so it made more sense to just exclude them.

Data Management, Representation, and Analysis

In order to try to identify trends or patterns, the data will be visualized in maps and graphs.

To get started, I decided to look at the yearly trends. After grouping all the data by the year, and plotting it, it is pretty clear that something happened in 2018. After reaching its peak, the number of crashes drastically dropped by about 25,000 crashes in 2020. I felt the need to break this down more, to see what's really going on.

After breaking things down into quarters, the trends continue. The majority of crashes occurred in the fourth quarter (the winter months). Perhaps due to inclimate weather? Perhaps due to the many holidays? Further breaking down is needed to understand why.

Not too surprisingly, the majority of crashes appear to occurr in the middle of the day, when the most people are on the roads. There is probably a lot more aggressive driving and speeding that occurrs when there are more people out on the roads. And, as mentioned in the introduction, they are the top two reasons for crashes. I'd like to have a look at some other variables though.

This is where things start to get a little confusing. I had thought, originally, that the most car accidents would likely occur at night (or at least at dusk/dawn). The last line graph (with the times) does uphold this pie chart. It appears the majority of crashes occur in the daylight.

Even more surprisingly (to me), than the amount of light, is the conditions of the roads. For the majority of crashes, it appears that the roads are dry. I would have thought that wet conditions would lead to more crashes (and that's what they emphasized in Driver's Ed. too!)

Luckily, only a small amount of crashes were fatal from 2015 to 2020! The majority appear to have been property damage crashes, which is understandable.

When plotting on a map, it is important to remember not to over-generalize. With too few colors, or too few distinguishing icons, it can be difficult to visualize what's going on (which defeats the whole purpose). With that being said, the opposite holds too. Too many can also be counterintuitve. I thought that 23 different colors (for the 23 different counties) would be too much. I decided to color code the five most populated counties in Maryland. They were (in order) Montgomery County, Prince George's County, Baltimore County, Baltimore City, and Anne Arundel County. Technically speaking, Baltimore City was not part of the top five, but I differentiated all of Baltimore City's surrounding counties and I did not want it to get lost (so I changed its color with the other populated counties). The majority of crashes appears to occur in these heavily populated areas.

Conclusion

While there is never one, simple answer to issues like this, we can certainly see some related factors. The time and location of the crashes in Maryland are pretty understandable, due to external factors. When there are more people out on the roads, it is more likely to get into a car accident. Even if you were incredibly safe, you cannot always trust others. This is also shown in the daylight/darkness chart. The majority of crashes occur when it is light out. And, it makes sense that there will be more crashes in more heavily populated areas. Baltimore County and Baltimore City definitely make sense. There is so much traffic crammed into a (relatively) small area. I was particularly shocked by the road conditions. I really thought that the majority of accidents would happen when the roads are wet and slick. But, I stand corrected. This reinforces the importance of reducing distracted driving. If external factors are not causing people to get into crashes, then it is likey the driver. It also makes perfect sense that the majority of crashes are property related damages. Ultimately, even if nobody is hurt, it is likely that the car(s) involved will have some sort of damage.

It would be interesting to see if this data continues into and after the 2021 year's worth of crashes. Looking at the year line graph, it appears that it may continue to decrease. A culture change has certainly helped with this, too. It's no longer "cool" to text and drive. Many people have begun to call out their friends and family when they see it. It will be interesting to see if that plays any role in the (possible) reduction in crashes next year.

If you would like more information on car crashes in Maryland, or information on the dangers of driving, these resources may be a good starting point.